Building Slovene WordNet
نویسندگان
چکیده
A WordNet is a lexical database in which nouns, verbs, adjectives and adverbs are organized in a conceptual hierarchy, linking semantically and lexically related concepts. Such semantic lexicons have become one of the most valuable resources for a wide range of NLP research and applications, such as semantic tagging, automatic word-sense disambiguation, information retrieval and document summarisation. Following the WordNet design for the English language developed at Princeton, WordNets for a number of other languages have been developed in the past decade, taking the idea into the domain of multilingual processing. This paper reports on the prototype Slovene WordNet which currently contains about 5,000 top-level concepts. The resource has been automatically translated from the Serbian WordNet, with the help of a bilingual dictionary, synset literals ranked according to the frequency of corpus occurrence, and results manually corrected. The paper presents the results obtained, discusses some problems encountered along the way and points out some possibilities of automated acquisition and refinement of synsets in the future.
منابع مشابه
Building the Slovene Wordnet: First Steps, First Problems
We report on the prototype Slovene wordnet which currently contains about 5,000 top-level concepts. The resource is based on the Serbian wordnet which has been automatically translated with the help of a bilingual dictionary, the literals ranked according to the frequency of corpus occurrence, and results manually corrected. The paper also discusses some problems encountered along the way and p...
متن کاملUsing Multilingual Resources for Building SloWNet Faster
This project report presents the results of an approach in which synsets for Slovene wordnet were induced automatically from parallel corpora and already existing wordnets. First, multilingual lexicons were obtained from word-aligned corpora and compared to the wordnets in various languages in order to disambiguate lexicon entries. Then appropriate synset ids were attached to Slovene entries fr...
متن کاملA Multilingual Approach to Building Slovene Wordnet
The paper presents an experiment in which synsets for Slovene wordnet were induced automatically from several multilingual resources. Our research is based on the assumption that translations are a plausible source of semantically relevant information. More specifically, we argue that the translational relation on the one hand reduces ambiguity of a source word and on the other conveys semantic...
متن کاملEnriching Slovene WordNet with domain-specific terms
The paper describes an innovative approach to expanding the domain coverage of wordnet by exploiting multiple resources. In the experiment described here we are using a large monolingual Slovene corpus of texts from the domain of informatics to harvest terminology from, and a parallel English-Slovene corpus and an online dictionary as bilingual resources to facilitate the mapping of terms to th...
متن کاملLeveraging Parallel Corpora and Existing Wordnets for Automatic Construction of the Slovene Wordnet
The paper reports on a series of experiments conducted in order to test the feasibility of automatically generating synsets for Slovene wordnet. The resources used were the multilingual parallel corpus of George Orwell’s Nineteen Eighty-Four and wordnets for several languages. First, the corpus was word-aligned to obtain multilingual lexicons and then these lexicons were compared to the wordnet...
متن کامل